OCDS Advanced Track Tutorial

Tutorial Learning Goals

Welcome to the NCSCO OCDS Intro track tutorial. We are Nick Gawron and Livia Popa, we will be working with you through today’s tutorial. We will be using R studio for today’s session.

We will be tackling these objectives:

  • Define open data and reproducible science
  • Recognize the importance of statistical analysis for climate data analysis using cloud based data
  • Apply techniques to access and explore publicly available climate data from the cloud using exploratory data analysis (i.e. tidyverse)
  • Create a machine learning model to predicts results for a representative case study

Meet Mr. Wuf!

Mr. Wuf works for Mount Mitchell State Park in Burnsville, NC and was recently asked by his boss to write a report summarizing rainfall and temperature data for 2021. This report will be used to help optimize 2022 event planning (e.g., fall color viewing) for park visitors and maintenance scheduling for park staff. He recently got promoted to the head office of the NC State Climate office! Mr. Wuf’s wife, Mrs. Wuf, recently told him about the State Climate Office of North Carolina’s new nClimgrid and Econet data portals. He agrees with her that it would be a great opportunity to check out these new, free tools. After some preliminary sleuthing around nClimgrid, he discovered there was a Monthly U.S Climate Gridded Dataset (nClimGrid; https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ncdc:C00332). How did he miss this? Once he downloads these data from nClimGrid and Econet, Mr. Wuf plans to put the skills he learned in an online R programming course to the test for this real-world, work-related project.

Importing Data

API Introduction

  • API is the acronym for Application Programming Interface that enables applications to exchange data and functionality easily and securely.

  • The API data that we will be using is called nClimgrid which consists of four climate variables derived from the GHCN-Dataset: maximum temperature, minimum temperature, average temperature, and precipitation.

nClimgrid Data Set

  • Blurb on DataSet
## Read in data from AWS nClimGrid 2020-2021
nclim_2020s<-read_csv('https://noaa-nclimgrid-daily-pds.s3.amazonaws.com/EpiNOAA/decadal/2020-2021-ste-decadal.csv')%>%drop_na()
## Rows: 32160 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (3): month, state, region_code
## dbl  (6): year, day, PRCP, TAVG, TMIN, TMAX
## date (1): date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Read in nClimGrid 2010-2019
nclim_2010s<-fread('https://noaa-nclimgrid-daily-pds.s3.amazonaws.com/EpiNOAA/decadal/2010-2019-ste-decadal.csv')%>%drop_na()
#If computation time takes a while ... we have   a .Rda file 
load("NCLIM2010.Rda")
load("NCLIM2020.Rda")

Functions/Cleaning

Combining Data Sets

## Lets quickly combind the data sets to get a data set ranging from the above
## lets also filter for data record that pertain to North Carolina
nclim <- rbind(nclim_2020s,nclim_2010s)%>%filter(state == c("North Carolina", "South Carolina"))
## Using the whole data set, subset values to only look at the years 2019 to 2020
## We will be using this data set for k-means clustering later
## Call the data set nclimk 
## This data set will have 34 thousand objects

nclimk_raw<-read_csv('https://noaa-nclimgrid-daily-pds.s3.amazonaws.com/EpiNOAA/csv/202101-cty-scaled.csv')
## Rows: 96317 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): date, month, state, county, region_code
## dbl (6): year, day, PRCP, TAVG, TMIN, TMAX
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Functions

Using the formula \((C \times \frac{9}{5}) + 32=F\), covert the temperature to Fahrenheit.

#create a function for basic syntax 
# convert data value temperature F to C
# Call it CtoF
CtoF<- function(datcols){
  x<- (9/5)*datcols+32
  return(x)
}

We now will need to apply this function to the temperature columns.

#apply the fucntion  CtoF with tidyverse functions 
# to the coloumns relating to tempurature
colnames(nclim)
##  [1] "date"        "year"        "month"       "day"         "state"      
##  [6] "region_code" "PRCP"        "TAVG"        "TMIN"        "TMAX"
nclim[8:10]<- nclim[8:10]%>% lapply(CtoF)
head(nclim)
## # A tibble: 6 x 10
##   date        year month   day state         region_code  PRCP  TAVG  TMIN  TMAX
##   <date>     <dbl> <chr> <dbl> <chr>         <chr>       <dbl> <dbl> <dbl> <dbl>
## 1 2020-01-01  2020 01        1 North Caroli~ 31           0     45.1  34.4  55.8
## 2 2020-01-03  2020 01        3 North Caroli~ 31           9.65  47.9  38.1  57.7
## 3 2020-01-05  2020 01        5 North Caroli~ 31           4.39  47.9  34.8  61.0
## 4 2020-01-07  2020 01        7 North Caroli~ 31           0.56  44.4  31.2  57.6
## 5 2020-01-09  2020 01        9 North Caroli~ 31           0     43.1  29.7  56.5
## 6 2020-01-11  2020 01       11 North Caroli~ 31           2.2   54.5  44.7  64.3

Cleaning

## Apply Tidyverse piping to easily add degree labels  to the end of certain coloumns for temp
## and to add units for precipitation
colnames(nclim)[8:10]<-colnames(nclim)[8:10]%>%tolower()%>% paste0("_degf")
colnames(nclim)[7]<-colnames(nclim)[7]%>%tolower()%>% paste0("_cm")

head(nclim)
## # A tibble: 6 x 10
##   date        year month   day state     region_code prcp_cm tavg_degf tmin_degf
##   <date>     <dbl> <chr> <dbl> <chr>     <chr>         <dbl>     <dbl>     <dbl>
## 1 2020-01-01  2020 01        1 North Ca~ 31             0         45.1      34.4
## 2 2020-01-03  2020 01        3 North Ca~ 31             9.65      47.9      38.1
## 3 2020-01-05  2020 01        5 North Ca~ 31             4.39      47.9      34.8
## 4 2020-01-07  2020 01        7 North Ca~ 31             0.56      44.4      31.2
## 5 2020-01-09  2020 01        9 North Ca~ 31             0         43.1      29.7
## 6 2020-01-11  2020 01       11 North Ca~ 31             2.2       54.5      44.7
## # ... with 1 more variable: tmax_degf <dbl>
## Keep the weather data coloumns as well as the date, drop all other coloumns 
## Further Create the variable ifRain: a factor to indicate wether it rained on a certain day
## Call this final data set nclimf

nclimf <- nclim %>% select(date,prcp_cm,tavg_degf,tmin_degf,tmax_degf,state)%>%
                    mutate(ifNC=as.factor(as.integer(state=="North Carolina")))%>%
                    mutate(ifRain= as.factor(as.integer(prcp_cm>0)))%>%
                    select(-state)
                           
                           
head(nclimf)
## # A tibble: 6 x 7
##   date       prcp_cm tavg_degf tmin_degf tmax_degf ifNC  ifRain
##   <date>       <dbl>     <dbl>     <dbl>     <dbl> <fct> <fct> 
## 1 2020-01-01    0         45.1      34.4      55.8 1     0     
## 2 2020-01-03    9.65      47.9      38.1      57.7 1     1     
## 3 2020-01-05    4.39      47.9      34.8      61.0 1     1     
## 4 2020-01-07    0.56      44.4      31.2      57.6 1     1     
## 5 2020-01-09    0         43.1      29.7      56.5 1     0     
## 6 2020-01-11    2.2       54.5      44.7      64.3 1     1

Cleaning for nclimk

Below we do the same cleaning and data preparation that we did for nclim. We note the addition of the county variable. This is because we will be looking at all of the data

### below we do the same cleaning for nlcimk 
#apply the ufnction CtoF with tidyverse

state2consider="FL" 

nclimk_last <- nclimk_raw%>%
               filter(state==state2consider)%>%
               select(-c(state,month,region_code,year,day))

#applying function CtoF
nclimk_last[4:6]<- nclimk_last[4:6]%>%lapply(CtoF)

# nclim_k colnames 
colnames(nclimk_last)[4:6]<-colnames(nclimk_last)[4:6]%>%
                            tolower()%>%
                            paste0("_degf")

#colnames for percp.
colnames(nclimk_last)[3]<-colnames(nclimk_last)[3]%>%
                          tolower()%>% 
                          paste0("_cm")

head(nclimk_last)
## # A tibble: 6 x 6
##   date      county         prcp_cm tavg_degf tmin_degf tmax_degf
##   <chr>     <chr>            <dbl>     <dbl>     <dbl>     <dbl>
## 1 01/1/2021 Alachua County    0         72.1      62.4      81.9
## 2 01/2/2021 Alachua County    0         75.3      66.1      84.5
## 3 01/3/2021 Alachua County    3.36      70.0      62.3      77.7
## 4 01/4/2021 Alachua County    1.76      56.0      41.1      71.0
## 5 01/5/2021 Alachua County    0         52.7      38.8      66.6
## 6 01/6/2021 Alachua County    0         53.3      37.2      69.4
tail(nclimk_last)
## # A tibble: 6 x 6
##   date       county            prcp_cm tavg_degf tmin_degf tmax_degf
##   <chr>      <chr>               <dbl>     <dbl>     <dbl>     <dbl>
## 1 01/26/2021 Washington County    0.04      66.8      58.4      75.1
## 2 01/27/2021 Washington County   25.8       66.9      60.9      72.8
## 3 01/28/2021 Washington County    5.81      58.5      43.0      74.0
## 4 01/29/2021 Washington County    0         45.7      35.1      56.4
## 5 01/30/2021 Washington County    0         47.1      32.6      61.6
## 6 01/31/2021 Washington County    1.99      51.4      37.6      65.2

EDA

  • We will inspect the data through a couple graphs as well as a simple linear regression
# scatterplot of avg tempurature vs time, color the points red
## add labels
## save the plot at plotOTemp
plotOtemp<-ggplot(nclim,aes(x=date,y=tavg_degf))+geom_point(colour="red")+labs(x = "Date", y = "Average Tempurature (F)",title ="Tempurature Over Time")
plotOtemp

Machine Learning

Simple Regression

  • Text Blurb
plotOtemp+geom_smooth(method="lm",se=TRUE,col = "black")+theme_classic()
## `geom_smooth()` using formula 'y ~ x'

Time Series Analysis

  • A time series is a collection of observations of well-defined data items obtained through repeated measurements over time

  • Time series analyze a sequence of data points collected over an interval of times

  • We will now try to fit a time series model to this data rather than a regression to see if the results improve. We will create a time series forecast with ARIMA (Auto Regressive Integrated Moving Average).

  • First we begin by creating an xts object

## with the variables date and average temperature, create a time series object. 
## save as temp_ts
temp_ts<-xts(nclimf$tavg_degf,nclim$date)
  • Now we will split the data into a training and testing set. Majority of the data will be in the training set.
# train/validation split
train_date <- nclimf$date[round(nrow(temp_ts) *0.7)]
train <- temp_ts[index(temp_ts) <= train_date]
test <- temp_ts[index(temp_ts) > train_date]
  • Next we plot the time series object we created to take a closer look at the data
# plot the time series object we called temp_ts
plot(temp_ts)

Time to build the model!

# build the time series model
model_ts <- auto.arima(as.numeric(train))
model_ts
## Series: as.numeric(train) 
## ARIMA(4,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2      ar3     ar4     mean
##       0.6543  0.3227  -0.3118  0.2889  61.5179
## s.e.  0.0195  0.0227   0.0228  0.0195   2.4412
## 
## sigma^2 = 31.12:  log likelihood = -7570.51
## AIC=15153.02   AICc=15153.06   BIC=15187.75
  • Once the model has been created, we will use that to create a forecasting xts object
# plot validation data with forecast on top of plot
#forecast <- forecast(model_ts, h=length(test))
#forecast_dates <- seq(train_date, length = length(test), by="day")

#forecast_xts <- xts(forecast$fitted, order.by=forecast_dates)
  • Finally we plot the validation (testing) data and layer the forecast on top for our time series model
#plot(test, main = 'Forecast Comparison')
#lines(forecast_xts, col="blue")
#tbats_test = tbats(as.vector(temp_ts))s
#tbats_forecast = forecast(tbats_test, h=length(test))
#plot(tbats_forecast)

K-Means

Clustering in nClimGridData

Creating Functions

  • We want to create a function that can do a complicated process in as few steps as possible.

  • calculate_cluster is the name of the function we are going to create to compute k-means clusters among other information.

calculate_cluster <- function(data, k) {
  x <- data %>%
    na.omit() %>%
    scale()
  
  df <- kmeans(x, center = k) %>%
    augment(data) %>% # creates column ".cluster" with cluster label
    # makes coloumn called silhouette, uses a cluster function called silhouette
    mutate(silhouette = cluster::silhouette(as.integer(.cluster), dist(x))[, "sil_width"])
    # calculate silhouette score
  return(df)
}
nclimk<-data.frame(nclimk_last)
##plot to the data of average tempurature w.r.t date
plot_ly(x=nclimk$prcp_cm,y=nclimk$tavg_degf,z=nclimk$tmax_degf)
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
# data subset nclimk
# chnage the date to the day of the year!
# select the column of the date and temperature only
# omit any missing values 
# inspect data
nclimk_tocluster <- nclimk %>%
    select(tavg_degf,prcp_cm,tmax_degf) %>%
    na.omit()
  


head(nclimk_tocluster)
##   tavg_degf prcp_cm tmax_degf
## 1    72.140    0.00    81.932
## 2    75.344    0.00    84.524
## 3    69.998    3.36    77.720
## 4    56.030    1.76    70.952
## 5    52.700    0.00    66.632
## 6    53.258    0.00    69.350
  • Now we want to set up a data frame to determine which runs

  • Note the importance of the map function. It works in a similar fashion to lapply, but faster due to background code.

# map cluster calculations function to range of k values
#
county_cluster_data <- tibble(k = 2:12) %>%
    mutate(kclust = map(k, ~calculate_cluster(nclimk_tocluster, .x))) %>%
    unnest(cols = c(kclust))
  
head(county_cluster_data)
## # A tibble: 6 x 6
##       k tavg_degf prcp_cm tmax_degf .cluster silhouette
##   <int>     <dbl>   <dbl>     <dbl> <fct>         <dbl>
## 1     2      72.1    0         81.9 2             0.556
## 2     2      75.3    0         84.5 2             0.513
## 3     2      70.0    3.36      77.7 2             0.537
## 4     2      56.0    1.76      71.0 1             0.172
## 5     2      52.7    0         66.6 1             0.580
## 6     2      53.3    0         69.4 1             0.445
str(county_cluster_data)
## tibble [22,847 x 6] (S3: tbl_df/tbl/data.frame)
##  $ k         : int [1:22847] 2 2 2 2 2 2 2 2 2 2 ...
##  $ tavg_degf : num [1:22847] 72.1 75.3 70 56 52.7 ...
##  $ prcp_cm   : num [1:22847] 0 0 3.36 1.76 0 0 0 1.09 0 0 ...
##  $ tmax_degf : num [1:22847] 81.9 84.5 77.7 71 66.6 ...
##  $ .cluster  : Factor w/ 12 levels "1","2","3","4",..: 2 2 2 1 1 1 1 2 1 1 ...
##  $ silhouette: num [1:22847] 0.556 0.513 0.537 0.172 0.58 ...
  # calculate average silhoutte score (highest for optimal number of k clusters)
  # for each k value
  # store these values for each value of k as temp_sil_score_data
  temp_sil_score_data <- county_cluster_data %>%
    group_by(k) %>%
    summarize(avg_sil_score = mean(silhouette, na.rm = TRUE))
 
    head( temp_sil_score_data) 
## # A tibble: 6 x 2
##       k avg_sil_score
##   <int>         <dbl>
## 1     2         0.449
## 2     3         0.492
## 3     4         0.461
## 4     5         0.438
## 5     6         0.396
## 6     7         0.367
  # find maximum
  temp_optimal_sil_score_data <- temp_sil_score_data %>%
    filter(max(avg_sil_score, na.rm = TRUE) == avg_sil_score)
  
  # save optimal k
  temp_optimal_k_value <- temp_optimal_sil_score_data$k

  temp_optimal_k_value
## [1] 3
  • Note we could have also used a visual inspection
# plot the data from temp_sil_score_data and visually inspect to determine the k
# with the largest sil score
# Add a marker point to visually show the max value!
temp_sil_score_data%>%
  ggplot(mapping =aes(x=k,y=avg_sil_score))+
  geom_line()+
  geom_point()+
  scale_x_continuous(breaks= pretty_breaks())+
  geom_point(data=temp_sil_score_data%>%
             filter(max(avg_sil_score)==avg_sil_score),
             pch=22,
             size=5,
             colour="purple")+
  labs(x="Value of k",y="Mean of Silhouette Score",title="Silhouette Score for Value of k")+
  theme_light()

ggsave("Silhouette_Plot_teacher.png")
## Saving 7 x 5 in image
  # save optimal k cluster data as nclimk_optimal_cluster
nclimk_optimal_cluster <- county_cluster_data %>%
                          filter(k == temp_optimal_k_value)

head(nclimk_optimal_cluster)
## # A tibble: 6 x 6
##       k tavg_degf prcp_cm tmax_degf .cluster silhouette
##   <int>     <dbl>   <dbl>     <dbl> <fct>         <dbl>
## 1     3      72.1    0         81.9 1            0.638 
## 2     3      75.3    0         84.5 1            0.581 
## 3     3      70.0    3.36      77.7 1            0.616 
## 4     3      56.0    1.76      71.0 1            0.0253
## 5     3      52.7    0         66.6 2            0.518 
## 6     3      53.3    0         69.4 2            0.337
tail(nclimk_optimal_cluster)
## # A tibble: 6 x 6
##       k tavg_degf prcp_cm tmax_degf .cluster silhouette
##   <int>     <dbl>   <dbl>     <dbl> <fct>         <dbl>
## 1     3      66.8    0.04      75.1 1             0.644
## 2     3      66.9   25.8       72.8 3             0.612
## 3     3      58.5    5.81      74.0 1             0.292
## 4     3      45.7    0         56.4 2             0.644
## 5     3      47.1    0         61.6 2             0.666
## 6     3      51.4    1.99      65.2 2             0.560
  • Now we want to add back in the counties for our analysis
# Call this ClusterWCounty

ClusterWCounty<-cbind(nclimk_optimal_cluster,nclimk$county,nclimk$date)
colnames(ClusterWCounty)[7:8]=c("county","date")
head(ClusterWCounty)
##   k tavg_degf prcp_cm tmax_degf .cluster silhouette         county      date
## 1 3    72.140    0.00    81.932        1 0.63844098 Alachua County 01/1/2021
## 2 3    75.344    0.00    84.524        1 0.58082918 Alachua County 01/2/2021
## 3 3    69.998    3.36    77.720        1 0.61627160 Alachua County 01/3/2021
## 4 3    56.030    1.76    70.952        1 0.02527828 Alachua County 01/4/2021
## 5 3    52.700    0.00    66.632        2 0.51783266 Alachua County 01/5/2021
## 6 3    53.258    0.00    69.350        2 0.33713965 Alachua County 01/6/2021
  • We have a function to compute the mode of a data set. I.e. This will help us compute the determine the cluster every county most often belongs too.
#function for the mode
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
#countyPlot<-ClusterWCounty%>%group_by(county)%>% 
#                summarise(Common_Cluster =mean(as.numeric((.cluster))))%>%
#                mutate(fips=fips(state=state2consider,county = county))%>%
#                select(-county)

## if i was to compute the mode
countyPlot<-ClusterWCounty%>%group_by(county)%>% 
                summarise(Common_Cluster =getmode(.cluster))%>%
                mutate(fips=fips(state=state2consider,county = county))%>%
               select(-county)

 countyPlot<- data.frame(countyPlot)
 head(countyPlot)
##   Common_Cluster  fips
## 1              2 12001
## 2              2 12003
## 3              2 12005
## 4              2 12007
## 5              1 12009
## 6              1 12011
plot_usmap( data = countyPlot, values = "Common_Cluster", "counties", include = c(state2consider), color="black")+labs(title=state2consider)+  
  #can comment out the below line if we use the mode method. 
# scale_fill_continuous(low = "#FFCC00", high = "#CC0000", name="Common_Cluster", label=scales::comma)+
    theme(legend.position="right") 

#3d plot for all of our clusters
plot_ly(x=nclimk$prcp_cm,y=nclimk$tavg_degf,z=nclimk$tmax_degf,color=ClusterWCounty$.cluster)
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Bonus (for time) Content!

PCA from ECOnet Data

cardinal <- read_csv("cardinal_data.csv", 
     col_types = list(`Average Air Temperature (F)` = col_number(), 
         `Maximum Air Temperature (F)` = col_number(), 
         `Minimum Air Temperature (F)` = col_number(), 
         `Average Experimental Leaf Wetness (mV)` = col_number(), 
         `Total Precipitation (in)` = col_number(), 
         `Average Relative Humidity (%)` = col_number(), 
         `Average Soil Moisture (m3/m3)` = col_number(), 
         `Average Soil Temperature (F)` = col_number(), 
         `Average Solar Radiation (W/m2)` = col_number(), 
         `Average Station Pressure (mb)` = col_number()))
## Warning: One or more parsing issues, see `problems()` for details
cardinal<-drop_na(cardinal)
str(cardinal)
## tibble [729 x 11] (S3: tbl_df/tbl/data.frame)
##  $ Date                                  : chr [1:729] "1/1/20" "1/2/20" "1/3/20" "1/4/20" ...
##  $ Average Air Temperature (F)           : num [1:729] 43.1 44.9 52.8 57.2 42.1 44.1 41.4 42.5 40.4 52 ...
##  $ Maximum Air Temperature (F)           : num [1:729] 53.6 55.4 64.9 65.1 50.5 58.5 52 57.6 50.5 65.8 ...
##  $ Minimum Air Temperature (F)           : num [1:729] 35.1 35.2 45.7 42.6 34.9 32 31.3 29.7 31.3 38.1 ...
##  $ Average Experimental Leaf Wetness (mV): num [1:729] 266 274 362 373 265 ...
##  $ Total Precipitation (in)              : num [1:729] 0 0.05 0.95 0.52 0 0 0.07 0 0 0 ...
##  $ Average Relative Humidity (%)         : num [1:729] 63.8 72 92.1 83.5 57 ...
##  $ Average Soil Moisture (m3/m3)         : num [1:729] 0.28 0.28 0.29 0.35 0.33 0.31 0.3 0.3 0.3 0.29 ...
##  $ Average Soil Temperature (F)          : num [1:729] 48.6 47.6 51 54.6 48.3 46.1 44.6 43.3 43.3 46.1 ...
##  $ Average Solar Radiation (W/m2)        : num [1:729] 134.8 66 31.1 44.9 135.4 ...
##  $ Average Station Pressure (mb)         : num [1:729] 999 1003 998 993 1005 ...
cardinal$Date<-as.Date(cardinal$Date, tryFormats= c("%m/%d/%y"))
view(cardinal)

#changes col names
colnames(cardinal)=c("date","AvgT","MaxT","MinT","AvgLw","Tprep","AvgHum","AvgSm","AvgSt","AvgSr","AvgStp")



cardinal$IfRain<- (cardinal$Tprep>0)
cardinal$IfRain<-as.factor(as.integer(cardinal$IfRain))

Basic Plotting with GGplot

ggplot(cardinal,aes(x=date,y=AvgT))+geom_line()+labs(title="Total Daily Rainfall by Date",y="Average Tempurature (F) ", x= "Date")

  • EDA is how we can motivate future ML models!

  • We can use forecasting to extend this trend!

PCA to cluster rain variable

  • using cardinal data to observe if there is clustering

  • used for future models

  • helps us describe higher dimensional data with less

Three general steps:

  1. Remove heavily correlated columns! - Min Temp and Max Temp for a certain day will correlate with one another!

  2. Center Data

Observe:

library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(cardinal[,-c(1,12)]))

  • Tells us to remove all but one temperature variable
IfRainVar<- cardinal$IfRain
cardshort <- cardinal%>%select(-c(date,IfRain,Tprep,MinT,MaxT))
cardshort
## # A tibble: 729 x 7
##     AvgT AvgLw AvgHum AvgSm AvgSt AvgSr AvgStp
##    <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1  43.1  266.   63.8  0.28  48.6 135.    999.
##  2  44.9  274.   72.0  0.28  47.6  66.0  1003.
##  3  52.8  362.   92.1  0.29  51    31.1   998.
##  4  57.2  373    83.5  0.35  54.6  44.9   993.
##  5  42.1  265.   57.0  0.33  48.3 135.   1005.
##  6  44.1  265.   57.6  0.31  46.1 138.   1005.
##  7  41.4  274.   75.2  0.3   44.6  40.9  1002.
##  8  42.5  314.   58.9  0.3   43.3 136.   1010.
##  9  40.4  265.   60.2  0.3   43.3 122.   1022.
## 10  52    266.   73.5  0.29  46.1  74.6  1019.
## # ... with 719 more rows
  • We will now actually conduct PCA on our data
pca_card<- princomp(scale(cardshort,scale=FALSE),cor = FALSE)
plot(pca_card$scores, pch = 16, col =IfRainVar)
legend("topright",c("No Rain","Rain"),pch=16,col=c("black","red"))

  • Here we can look at how good PCA does at describing changes in data

  • We see 2 components describes 96% of the data’s variation ! (This is very good)

summary(pca_card)
## Importance of components:
##                            Comp.1     Comp.2      Comp.3     Comp.4      Comp.5
## Standard deviation     93.1434884 45.4290622 18.05966455 6.31468558 5.494486034
## Proportion of Variance  0.7784786  0.1851865  0.02926584 0.00357804 0.002708918
## Cumulative Proportion   0.7784786  0.9636651  0.99293094 0.99650898 0.999217895
##                              Comp.6       Comp.7
## Standard deviation     2.9520630681 3.804718e-02
## Proportion of Variance 0.0007819752 1.298933e-07
## Cumulative Proportion  0.9999998701 1.000000e+00
screeplot(pca_card, type = "lines")

Closing & Resources

  • Define open data and reproducible science
  • Recognize the importance of statistical analysis for climate data analysis using cloud based data
  • Apply techniques to access and explore publicly available climate data from the cloud using exploratory data analysis
  • Create a machine learning model to predicts results for a representative case study

Any Questions?

Resouces

  • For a A Hydrologists Guide to Open Science, from the HESS Journal, click here.

Survey

Thank you for attending the beginner track tutorial at the Open Climate Data Science Workshop. Please complete the voluntary feedback survey, linked here. The main purposes of this survey is to (1) determine we met specified teaching goals and (2) improve teaching materials for subsequent tutorial sessions.